Note: This exercise is adapted from the original here. As of September 2020 if you install pandas_profiling on conda you might get an old version (1.41) as it seems for this package some channels on conda are a bit older then the latest version on pypi (2.9.0 as of September 2020). To be super clear you can see the exact enviornment and library versions used to run this exercise in the Pipefile (see pipenv for more details) of this example here.

Pandas Profiling: NASA Meteorites example

Source of data: https://data.nasa.gov/Space-Science/Meteorite-Landings/gh4g-9sfh

The autoreload instruction reloads modules automatically before code execution, which is helpful for the update below.

In [1]:
%load_ext autoreload
%autoreload 2

Make sure that we have the latest version of pandas-profiling.

In [2]:
# uncomment and run below if you need to pip install the pandas-profiling library
#import sys
#!{sys.executable} -m pip install -U pandas-profiling==2.9.0
#!jupyter nbextension enable --py widgetsnbextension

You might want to restart the kernel now.

Import libraries

In [4]:
from pathlib import Path

import requests
import numpy as np
import pandas as pd

import pandas_profiling
from pandas_profiling.utils.cache import cache_file

Load and prepare example dataset

We add some fake variables for illustrating pandas-profiling capabilities

In [5]:
file_name = cache_file(
    "meteorites.csv",
    "https://data.nasa.gov/api/views/gh4g-9sfh/rows.csv?accessType=DOWNLOAD",
)
    
df = pd.read_csv(file_name)
    
# Note: Pandas does not support dates before 1880, so we ignore these for this analysis
df['year'] = pd.to_datetime(df['year'], errors='coerce')

# Example: Constant variable
df['source'] = "NASA"

# Example: Boolean variable
df['boolean'] = np.random.choice([True, False], df.shape[0])

# Example: Mixed with base types
df['mixed'] = np.random.choice([1, "A"], df.shape[0])

# Example: Highly correlated variables
df['reclat_city'] = df['reclat'] + np.random.normal(scale=5,size=(len(df)))

# Example: Duplicate observations
duplicates_to_add = pd.DataFrame(df.iloc[0:10])
duplicates_to_add[u'name'] = duplicates_to_add[u'name'] + " copy"

df = df.append(duplicates_to_add, ignore_index=True)

Inline report without saving object

In [7]:
report = df.profile_report(sort='ascending', html={'style':{'full_width': True}}, progress_bar=False)
report
Out[7]:

Save report to file

In [8]:
profile_report = df.profile_report(html={'style': {'full_width': True}})
profile_report.to_file("tmp/example.html")

More analysis (Unicode) and Print existing ProfileReport object inline

In [9]:
profile_report = df.profile_report(explorative=True, html={'style': {'full_width': True}})
profile_report
Out[9]:

Notebook Widgets

In [10]:
profile_report.to_widgets()
In [ ]: